Understanding differences between what alternate propensity score methods estimate

BACKGROUND: Many approaches to propensity score methods are used in the applied health economics and outcomes research literature. Often this creates confusion when different approaches produce different results for the same data. OBJECTIVE: To present a conceptual overview based on a potential outcomes framework to demonstrate how more than 1 mean treatment effect parameter can be estimated using the propensity score methods and how the selection of appropriate methods should align with the scientific questions. METHODS: We highlight that more than 1 mean treatment effect parameter can be estimated using the propensity score methods. Using the potential outcomes framework and alternate data-generating processes, we discuss under what assumptions different mean treatment effect parameter estimates are supposed to vary. We tie these discussions with propensity score methods to show that different approaches may estimate different parameters. We illustrate these methods using a case study of the comparative effectiveness of apixaban vs warfarin on the likelihood of stroke among patients with a prior diagnosis of atrial fibrillation. RESULTS: Different mean treatment effect parameters take on different values when treatment effects are heterogeneous. We show that traditional propensity score approaches, such as blocking, weighting, matching, or doubly robust, can estimate different mean treatment effect parameters. Therefore, they may not produce the same results even when applied to the same data using the same covariates. We found significant differences in our case study estimates of mean treatment effect parameters. Still, once a mean treatment effect parameter is targeted, estimates across different methods are not different. This highlights the importance of first selecting the target parameter for analysis by aligning the interpretation of the target parameter with the scientific questions and then selecting the specific method to estimate this target parameter. CONCLUSIONS: We present a conceptual overview of propensity score methods in health economics and outcomes research from a potential outcomes framework. We hope these discussions will help applied researchers choose appropriate propensity score approaches for their analysis.


OBJECTIVE:
To present a conceptual overview based on a potential outcomes framework to demonstrate how more than 1 mean treatment effect parameter can be estimated using the propensity score methods and how the selection of appropriate methods should align with the scientific questions.

METHODS:
We highlight that more than 1 mean treatment effect parameter can be estimated using the propensity score methods. Using the potential outcomes framework and alternate data-generating processes, we discuss under what assumptions different mean treatment effect parameter estimates are supposed to vary. We tie these discussions with propensity score methods to show that different approaches may estimate different parameters. We illustrate these methods using a case study of the comparative effectiveness of apixaban vs warfarin on the likelihood of stroke among patients with a prior diagnosis of atrial fibrillation.

RESULTS
: Different mean treatment effect parameters take on different values when treatment effects are heterogeneous. We show that traditional propensity score approaches, such as blocking, weighting, matching, or doubly robust, can estimate different mean treatment effect parameters. Therefore, they may not produce the same results even when applied to the same data using the same covariates. We found significant differences in our case study estimates of mean treatment effect parameters. Still, once a mean treatment effect parameter is targeted, estimates across different methods are not different. This highlights the importance of first selecting the target parameter for analysis by aligning the interpretation of the target parameter with the scientific questions and then selecting the specific method to estimate this target parameter.

CONCLUSIONS:
We present a conceptual overview of propensity score methods in health economics and outcomes research from a potential outcomes framework. We hope these discussions will help applied researchers choose appropriate propensity score approaches for their analysis.

Plain language summary
The propensity score is the estimated probability of treatment assignment as a function of covariates, and it can be used in different ways to compare outcomes across 2 or more treatments in an observational data setting. Different propensity score approaches compare treatments for slightly different subgroups of patients, generating different estimates even when applied to the same data and covariates. We provide broad discussions of these issues so that applied researchers can choose the appropriate methods.

Implications for managed care pharmacy
Any application may have different types of comparative effectiveness questions, and the answers to these questions could be different. A widely popular method, the propensity score method, used to answer these questions can be applied using many different approaches. These approaches can produce different estimates when applied to the same data and covariates. Pharmacy and therapeutics committees in managed care pharmacies should be aware of these methodological differences to interpret results correctly.
Propensity score (PS) methods are popular statistical methods used to generate comparative effectiveness estimates from observational data. A PS is the estimated probability of treatment assignment as a function of covariates and can be used in different ways to compare outcomes across 2 or more treatments in an observational data setting. There are many different approaches through which the estimated PS can be employed to generate comparative effects. However, these approaches usually identify the comparative effects for slightly different subgroups of patients, generating different estimates even when applied to the same data and covariates. In this paper, we provide a guide for selecting between different PS methods. Our goal is not to review the practical details of implementing any specific PS method; there are plenty of reviews that do that. [1][2][3][4][5] Our goal is to highlight the ideas of what each statistical method is trying to estimate, ie, what is the statistical method's "target parameter"? And, importantly, how does this differ from the target parameter we want to know to inform our decisions? Only then may we be able to infer whether a specific statistical method can estimate our preferred target accurately. We later give examples of target parameters and how they could differ from what a decision-maker wants an answer for. We start with some context of why these methods are used in the first place.
When outcomes observed in individuals receiving 2 different interventions (or treatments) are compared in observational data, statistical adjustment methods are needed because individuals receiving different treatments may differ in their baseline characteristics, which can also affect outcomes. A naive comparison of outcomes between the 2 intervention groups produces a biased assessment of the effect of 1 intervention over the other because some of the differences in outcomes may have been caused by differences in the characteristics of individuals receiving different interventions (Figure 1). This phenomenon is referred to as confounding by indication or confounding bias. Statistical methods are required to address these confounding biases. There is extensive methods literature that deals with confounding biases. [6][7][8] Confounding bias can be due to any baseline characteristics that affect outcomes, irrespective of whether those characteristics are measured in the data at hand. The statistical world is split between 2 genres of methods: one that attempts to address all confounding biases emanating from any characteristics, measured or unmeasured in the data at hand, and the other that "assumes" that there is no unmeasured confounding bias and all confounding biases are due to characteristics that are measured in the data at hand. The former set of methods provides the clearest causal interpretation of comparative effects; however, implementing such statistical methods can be quite daunting and sometimes impossible. Methods dealing with unobserved confounding can be found elsewhere. 9 The later set of methods is much easier to implement but only carry the causal interpretation for their estimated effects under the assumption of no unmeasured confounding. In other words, these methods invoke the assumption of "overt" confounding bias, which assumes that the entirety of these confounding biases is driven by factors observed by the analyst ( Figure 1A). This assumption has several names: selection on observables, exogeneity, or unconfoundedness. When such an assumption is invoked in an analysis, regression methods ( Figure 1B), PS methods ( Figure 1C), or their combinations can be used to address the (overt) confounding biases, and the estimated comparative effects are interpreted as causal in nature.
Throughout this paper, we discuss PS methods; hence, any application of such methods invokes the overt confounding bias assumption. However, as mentioned above, even with this assumption in place, different PS approaches usually identify the comparative effects for slightly different subgroups of patients, generating different estimates even when applied to the same data and covariates. The population subgroup of patients most relevant for decision-making defines our target parameter of interest. However, the target parameter that is estimated by a particular method may not reflect that subgroup of interest. Consequently, results could be misleading to the decision-making at hand. A central essence of our discussions is the notion of treatment effect heterogeneity, in which the comparative effects differ based on levels of baseline characteristics. In the presence of such treatment effect heterogeneity, the average effects for different subgroups can differ and, therefore, understanding whether a specific approach is estimating the average effects for the subgroup of interest could be relevant. This issue has less commonly been discussed in the applied health economics and outcomes research literature. We discuss these concepts using the framework of potential outcomes, which is where we start in the next section. We then illustrate these concepts using a case study of the comparative effectiveness of apixaban vs warfarin on the likelihood of stroke among patients with a prior diagnosis of atrial fibrillation.

UNDERSTANDING DIFFERENT TARGET PARAMETERS FOR DECISION-MAKING
It is important to define a target parameter for one's query before comparative analysis begins. A target parameter is to target their mandate to specific subpopulations. These questions can be informed using Conditional ATE (CATE) parameters, which involves the conditioning of particular household characteristics that can be easily targeted from a policy tool perspective. In addition, the legislature may also want to know the average effect of receiving MCHSs among those who actually receive MCHSs in the real world. This parameter is called the treatment ATE on the Treated (ATT). 12 It is different than the ATE because only a selected group of households choose to receive these services, and their average effect could be different because of the heterogeneity in treatment effects. Many economists have argued that decisions about the provision of services should be based on ATT rather than ATE because the former parameter is more likely to represent the ex-post realization of outcomes with the provision. 11 The provision of services would not affect households who never choose to receive those services, with or without the provision. Similarly, the ATE on the Untreated (TUT) may also be relevant in decision-making. In this context, it represents the average effect of receiving MCHSs among those who currently do not get MCHSs. A large positive TUT for MCHS would suggest that a generous implementation policy should be invested to increase these services' uptake even after access is provided.
Once the state decides to mandate the provision of MCHSs for all low-income households in the state, one can study a different ATE of the provision itself. What happens to all target households in the state with and without the defined at the population level. As analysts, we usually have a sample from this population to obtain estimates for these parameters. The default notion is that we are interested in the Average Treatment Effect (ATE) parameter because randomized clinical trials (RCTs) attempt to estimate this parameter, assuming representative enrollment in trials and other general assumptions. 10 The ATE for a new treatment "T" vs a standard treatment "S" would represent the difference in the average outcomes if everyone in the population received T vs if everyone received S. However, ATE is not the only useful parameter for health care and policy decision-making. 11 Suppose the goal is to study the effect of a certain legal policy or tax policy that affects everyone in the population. In that case, it is worthwhile to know about the average effect of that policy on the overall population. However, other mean treatment effect parameters may be useful even in this case. For example, suppose a state wants to mandate the provision of maternal and child health services (MCHSs) for all lowincome households in the state. In that case, the legislature may first want to know what is the ATE of MCHS receipt, ie, what would happen if all target households get these services vs if none of them get them. This is a complicated parameter to estimate. Only through an RCT or another natural experiment going on in a different state that can effectively randomize a representative sample of the target population can one try to estimate this parameter. The legislature may also want to know who benefits and does not benefit from the MCHS services so they can decide of people in the population. Such conceptualization can be deemed unrealistic in most real comparison settings. Still, it immediately simplifies the interpretation of the mean intervention effect that any statistical method is trying to estimate-all target parameters have the same value. It does not matter what subpopulation data a statistical method relies on to estimate a mean intervention effect, it would always estimate the ATE. Consequently, it implies that any mean target parameter that we may prefer for our decision-making would have the same value as the ATE, and results from any statistical method would suffice to answer our needs. This luxury obviously goes away when we make a more realistic assumption for our data-generating process for potential outcomes, in which we allow the effect of individual characteristics on potential outcomes to be different for different intervention states. This implies that the average effect of intervention T over S will differ for different population subgroups. Under this data-generating process, it is not expected that different statistical methods, which often identify different target parameters by focusing on slightly different subgroups of patients, would all produce the same estimate. More importantly, as discussed in the section above, it becomes necessary to have a more nuanced understanding of what target parameter we need for decision-making to select the proper statistical method.

ESTIMATION OF PSs
A PS is the probability of treatment receipt as a function of covariates. The PS is denoted as e(X) = Pr(D = 1 | X). 14 A typical dataset that we collect from an individual will consist of 3 elements: data i є {Y i , X i , D i }, where Y is the observed outcome, X is a vector of characteristics for individuals that affect outcomes, and D is an indicator identifying whether the individual received intervention T (D = 1) or S (D = 0). The true PS, e(X), is not directly observed in the data. Hence, in the first part of any PS method, PS is estimated from the data at hand. The PS e(X) is estimated from a model for the likelihood of treatment, such as a logistic regression model, log[e(X)/{1-e(X)}] = X·θ. Let the estimated PS be denoted with a hat: ê (X). As long as this PS estimation model is overspecified to approximate a fully saturated regression model, the estimated PS can be used in place of the true PS to invoke the properties of the true PS. In practice, this is usually achieved by allowing all potential interactions and at least second-order polynomials of all covariates in the PS regression model. 15 Overfitting is not a problem in PS estimation as long as due diligence is applied in selecting covariates that are most likely risk factors for the outcomes. However, what factors are truly confounders and which one has enough information in the data to be used in modeling (eg, a very provision? This is often known as the average policy effect, which is the policy's ATE. In addition, one may also want to know the average effect of MCHS for those who were induced to receive MCHSs because of the policy. This is often called the Local ATE (LATE) or the Average Complier Effect. LATE, or the Average Complier Effect, differs from ATT because whereas ATT represents the average effect among all MCHS recipients, LATE estimates the average effect only among those new recipients induced by the policy change.
An analog of the MCHS example to other decisionmaking contexts, such as coverage decisions for new technology, is straightforward. Many other nuanced mean treatment effect parameters are available in the economics and statistical literature. It is important to note that the fundamental difference in the definition of these mean treatment effect parameters arises because of differences in the population subgroups for which the average effects are sought. Therefore, determining which parameter to target in a statistical analysis should depend on the decision that the analysis seeks to inform.

POTENTIAL OUTCOMES FRAMEWORK AND DATA-GENERATING PROCESSES
Potential outcome corresponds to the outcome that could be realized had an individual received a specific intervention. 13 In reality, though, only 1 of the potential outcomes for an individual is observed in the data. That is, the observed outcome is the potential outcome that corresponds to the intervention that this individual receives in real life. This incompleteness in information for each individual sets up the challenge of figuring out the counterfactual potential outcome had this individual received a different intervention. Such incompleteness necessitates statistical estimation. However, potential outcomes help us define the parameter that we intend to estimate with our statistical machinery.
For example, if we take the difference between the 2 potential outcomes and the average for all individuals in a population, we will get what is known as the ATE parameter. The absolute levels of these potential outcomes for any individual could naturally depend on other characteristics of this individual, eg, one's age, sex, comorbidities, etc. If the effect of individual characteristics on potential outcomes is the same for all intervention states, then even if we wanted the average effect of intervention T vs S for a specific subgroup of the population (eg, one that is defined by specific levels of their characteristics), the CATE would be identical to the ATE. This implies that we have conceptualized a potential outcomes model in which the average effect of intervention T over S is constant across all groups estimator, and (3) a local-linear regression-based matching estimator that uses a tricube kernel.
Nearest neighbor (with and without a caliper) and radius matching estimator. According to the theory, if the matching of an estimated PS from treated and untreated groups is performed based on some arbitrary small neighborhood (є) of ê(X i ), then the joint distribution of X is still approximately the same for the treated sample and the untreated sample within the neighborhood. See the Supplementary Materials for technical details about different ways such matching can be implemented.
To estimate ATE, one needs to find a match for each observed PS for the treated group and for each observed PS for the untreated group. Matching is usually carried out with replacement (ie, the same observation can be used for matching multiple times) and can be 1:1 (especially for nearest neighbor estimators) or 1:m (for radius estimators). Matching related to ATE estimation can be carried out for subgroups of patients with specific levels of X to obtain estimates of CATE.
To estimate ATT, one needs to only find a match for each observed PS for the treated group. Similarly, one needs to only find a match for each observed PS for the untreated group to estimate TUT.

Kernel-based or local-linear regression-based matching estimators.
These methods are similar to the above matching methods in that they typically implement a 1:m matching based on a specified bandwidth. However, they differ from the above in that not all matched observations get the same weight when averaging across them. The weights are calculated based on a specific kernel or the local-linear regression. See the Supplementary Materials for technical details.
CATE, ATT, and TUT can be estimated using the same principles as in nearest-neighbor and radius matching estimators.

Doubly Robust Estimators.
In this case, the inverse PS weighting and adjustment by regression modeling approach are combined to estimate the ATE. This can be thought of as an augmented version of the inverse PS weighting model. 26 The augmentation serves to increase estimator efficiency. Another advantage of the use of doubly robust estimators is their consistency in the presence of misspecification. See the Supplementary Materials for technical details.
The mean treatment parameter estimate will be consistent if the PS model is correctly specified or the regression model is correctly specified. This property gives rise to the double robustness designation. rare comorbidity indicator) should be carefully determined based on substantive theory and empirical exercise. 16 More recently, advanced machine learning methods, 17 like boosted trees, have been used to estimate the PSs to identify the most influential confounders and also produce consistent estimates of PSs.
Once the PS is estimated, there are important checks one should conduct regarding overlapping support, trimming, etc. There are plenty of reviews that cover these topics. [1][2][3][4][5] However, as mentioned before, this paper focuses on how to use the estimated PS. There are various ways through which the balancing property of PSs is used in practice, including weighting by the reciprocal of PSs, blocking on PSs, regression on PSs, and matching on PSs. However, each approach uses a slightly different set of observations (or weights the same observation differently) to estimate the specific mean treatment effect parameter. This implies that the target parameters identified by these approaches could be different, and they are expected to produce different treatment effect estimates in the presence of treatment effect heterogeneity.

ALTERNATIVE METHODS FOR USING ESTIMATED PSs
Stratifying by Quintiles of PSs. This is one of the most commonly used methods in health services research. 18 Here, the empirical distribution of the estimated PS across the entire sample (including treated and untreated participants) is divided into quintiles. Indicator variables for the first 4 quintiles are then used as covariates, along with the treatment indicator and the interactions between them, in an ordinary least squares regression. [19][20][21] The ATE across all subgroups provides an estimate of ATE. See the Supplementary Materials (available in online article) for technical details.
Inverse Weighting with PSs. In this method, the difference in the weighted average of the outcomes between the treatment and untreated groups gives a consistent estimate of the different mean treatment effect parameters depending on how the weights are constructed and to which observations these weights are applied. 22 See the Supplementary Materials for technical details.

Matching with PSs.
There is a large literature on matching PSs to control for bias due to observable variables. 23 Matching estimators are also nonparametric, but unlike the weighting estimators shown above, matching estimators are less sensitive to the parametric specification of the PS. 24 Because the number of available matching estimators is quite large, we select 3 of the most commonly used matching estimators in practice 25 : (1) a nearest neighbor (with and without a caliper) and radius matching estimator, (2) a kernel-based (using the Epanechnikov kernel) matching Benefits database covers claims from Medicare-eligible retirees. The data in this study covers the 2015 to 2017 time frame.
The selected sample includes those with a prescription of apixaban or warfarin who are aged at least 18 years. These individuals are covered by a commercial or Medicare insurance plan in the year before the anticoagulant prescription and had been previously diagnosed with atrial fibrillation. Certain groups, such as those with a diagnosis of venous thromboembolism and those with more than 1 oral anticoagulant prescription, are excluded. The occurrence of a stroke in the year following oral anticoagulant prescription is the outcome of interest. Data on patient demographics are also included in the final dataset. The final sample consists of 85,669 observations. The summary statistics for the full sample and those by prescription are presented in Table 1.
Approximately 13% of the sample reported a stroke in the year after prescription (15% receiving warfarin and 11% receiving apixaban). The apixaban group is also younger, more likely to be in the workforce, more likely to be the primary insurance beneficiary, and more likely to be female than patients receiving warfarin. On the other hand, a lower proportion of patients receiving apixaban are enrolled in Medicare.
There is also variation in the clinical characteristics by prescription. A total of 12.3% of those receiving apixaban report a depression diagnosis compared with 10.3% of the warfarin group. Those receiving apixaban are also more likely to report alcohol and opioid abuse, high cholesterol, hypertension, overweight or obese, and a rheumatoid arthritis diagnosis. On other measures, such as heart failure and diabetes diagnosis, a higher proportion of those receiving Two such medications are apixaban and warfarin. In this case study, we compare populations with a prior diagnosis of atrial fibrillation receiving apixaban vs those receiving warfarin. The likelihood of a stroke in the year after beginning the prescription is the measure of interest.
The study uses data from the Truven Health Marketscan Databases. It captures person-specific data on clinical utilization, expenditure, and insurance coverage. In particular, we use the Commercial Claims and Encounters and the Medicare Supplemental and Coordination of Benefits databases. The Commercial Claims and Encounters database is a collection of inpatient, outpatient, and pharmaceutical claims from commercial insurance carriers. The Medicare Supplemental and Coordination of

AN EMPIRICAL CASE STUDY
Atrial fibrillation is the irregular or rapid beating of the heart's upper chamber. This fast or irregular heartbeat can result in the formation of blood clots because of the incomplete evacuation of blood from the heart chamber, which could travel to other parts of the body. Certain factors, such as advanced age, alcohol consumption, underlying heart disease, other chronic conditions, and family history, are associated with an increased likelihood of atrial fibrillation. A complication of this condition is an increased risk of stroke.
A number of prescription medications, known as anticoagulants, have been shown to reduce stroke risk in patients with atrial fibrillation.   Table 2 lists the comparison of the standardized mean differences between treatments for each of the covariates in the observed data and after inverse probability weighting (ATE weights) by the estimated PSs. A general rule of thumb to identify good balancing is to see whether the standardized mean differences are below 0.1 for each covariate. 27 In this case, we achieved a good balance for each of the covariates after weights.
Estimates generated via PS matching, stratification, inverse probability of treatment weights, and the doubly robust method are presented In Table 3. In each case, the estimates for the ATE and the ATT are calculated. The accompanying SEs are obtained via bootstrap. Differences between ATE and ATT are also presented, along with their bootstrapped SEs, that would account for the correlation between these parameters.
The results indicate a decrease in the likelihood of stroke for patients receiving apixaban compared with warfarin. However, there is a difference in the estimates for the ATT and ATE. The results from the ATE models show a 2.9-3.0 percentage point decrease in the likelihood of a stroke if all individuals are receiving apixaban as opposed to warfarin. The estimates of the ATT show a 3.3-3.5 percentage point decrease, which implies that apixaban reduced the likelihood of stroke by this magnitude for those who received the drug, compared with warfarin.
It is important to note that the difference between the ATE and ATT estimates are statistically significant (Table 3). Substantively, ATTs are 13%-20% different when compared with ATEs. In this case, the prescription of interest has a larger impact on the treated group than it would have on the general population. This difference has 2 implications. One is that there is treatment effect heterogeneity in this population. Second, there is a positive self-selection in practice in which prescribing patterns follow some of this effect heterogeneity even if they have not been formally established in RCTs. 28 Either of these estimates will be more appropriate depending on the inquiry's context and interest.

LIMITATIONS
PS methods provide a robust set of tools to compare outcomes across multiple cohorts when the intent is to balance the distribution of confounders observed in the data at hand across the cohorts. They do not, under any circumstances, address issues related to adjusting for confounders that are not observed in the dataset at hand. In that sense, they cannot truly mimic a randomized comparative effectiveness study. However, with no unobserved confounders (eg, exogeneity, selection on observables, uncounfoundedness, etc), PS methods can be used to estimate the population warfarin report these conditions. The last column of Table  1 reports the P values for a t-test for the equality of means or proportions. In general, there is a significant difference in the measures between the apixaban and warfarin groups. The only exception is the renal failure measure for which there is no statistically significant difference between the apixaban and warfarin groups.
The PS model estimates the likelihood of receiving apixaban as a treatment. It is estimated via a logistic regression model in which an indicator variable denoting apixaban use was regressed on 15 variables. These include age, employment status, source of health insurance coverage, sex, relationship to the insurance holder, and state of residence. A comprehensive list of the variables included in the model is shown in Table 1. We added 2-way interactions of these variables along with second-order polynomials for age to the model. The estimated PS showed common support for both treatments ranging from 0.003 to 0.996. The PS values were capped to lie between 0.05 and 0.95 for our main analysis to generate weights without high leverage. However, we carried out a robustness analysis without capping, and the results were similar to those presented here.  TABLE 2 administration of apixaban vs warfarin on stroke outcomes among patients with atrial fibrillation. An important point of this case study was to show that the results from any of the PS approaches are very similar as long as we make sure we are identifying the same parameter. Therefore, it will be helpful for researchers presenting results using PS methods to articulate the specific target parameter they have in mind for a decision-relevant question and whether their chosen approach can identify that particular parameter.
We hope that these discussions and the illustration of our empirical example can help applied researchers in the health economies and outcomes research space to use these methods correctly.

DISCLOSURES
Dr Unuigbe's time was supported through an unrestricted postdoctoral fellowship from Pfizer to the University of Washington, Seattle. mean treatment effect parameters. There are several checks and diagnostics that analysts should apply to ensure the appropriateness of applying PSs. We have provided several references to this end. However, this paper is about choosing the right PS approach to answer a decision-relevant question.

Conclusions
We discussed various ways PS methods can be applied and also that more than 1 mean treatment effect parameter can be estimated using the PS methods. Selecting which of these parameters should be the target for analysis can be determined by aligning the interpretation of the target parameter with the scientific questions at hand. Following this selection, a proper approach was chosen to utilize the estimated PS to inform this selected parameter. An illustration of these methods and the importance of choosing the appropriate target parameter was provided through a case study comparing the